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ABSTRACT 


Estimating  component  reliabilities  along  with  the  system  reliability 
frequently  requires  using  lifetimes  from  the  system  level.  Due  to  cost  and  time 
constraints,  however,  the  exact  cause  of  system  failure  may  be  unknown. 

Instead,  it  may  only  be  ascertained  that  the  cause  of  failure  is  due  to  one 
component  in  a  subset  of  components,  e.g.,  the  subset  forms  a  subsystem. 
Confronted  with  such  data,  this  article  discusses  how  to  exploit  fully  the 
available  information  using  a  maximum  likelihood  approach.  We  extend  and 
clarify  the  useful  work  of  Miyakawa  (1984).  A  small  Monte  Carlo  study  indicates 
the  helpfulness  of  this  approach.  v‘  >  • 

KEY  WORDS:  Reliability  estimation;  Partially  masked  cause  of  failure; 

Incomplete  data;  Maximum  likelihood  estimation;  Reliability  data  bases.. 
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1.  INTRODUCTION 


Estimating  component  reliabilities  along  with  the  system  reliability 
frequently  requires  using  lifetimes  from  the  system  level.  This  can  arise  from 
field  life  data  or  from  accelerated  or  nonaccelerated  life  testing  data. 
Miyakawa  (1984)  remarked  that,  "...  investigation  of  the  cause  of  failure  is 
expensive  and  requires  time,  and  hence  sometimes  the  cause  of  failure  is  not 
observed,  even  if  the  failure  time  is  observed."  Usher  (1987)  noted,  "...  when 
large  computer  systems  fail  in  the  field,  analysis  is  usually  performed  such 
that  a  small  subset  of  components,  perhaps  a  circuit  card,  is  identified  as  the 
cause  of  failure.  In  an  attempt  to  repair  the  system  as  quickly  as  possible, 
the  entire  subset  of  components  is  replaced  and  the  exact  failing  component  may 
not  be  investigated  further."  See  also  Gross  (1970).  For  related  biological 
data  compare  Dinse  (1986). 

We  use  a  parametric  approach  because  in  engineering  problems  realistic 
parametric  models  are  often  available.  Also,  parametric  estimators  of 
reliability  can  be  much  more  efficient  than  nonparametric  ones. 

Data  for  which  the  cause  of  failure  is  narrowed  to  a  subset  of  components 
(e.g.,  subset  {1,2,3}),  we  call  masked  because  the  true  cause  of  failure  is 
partially  masked  from  our  knowledge.  Note  that  this  can  be  viewed  as  a  type  of 
censored  data.  Here  the  cause  of  failure  is  censored,  but  the  system  time  may 
be  complete  (i.e.,  uncensored)  or  censored.  We  handle  both  cases  related  to 
time  and  masked  data. 

For  the  first  case  of  masked  but  time  complete  data  we  derive  its  full 
likelihood  in  Section  2.  This  full  likelihood  extends  and  clarifies  Miyakawa' s 


v, 


I 


pv 


(1984)  useful  likelihood.  From  our  full  likelihood,  we  give  a  helpful  partial 
likelihood  and  conditions  for  its  proper  statistical  use.  This  provides  further 
insight  and  understanding  of  Miyakawa's  (1984)  likelihood.  In  fact,  if  the 
conditions  we  give  are  not  met  then  estimators  based  on  Miyakawa's  likelihood 
will  actually  be  inconsistent.  (I.e.,  the  estimators  will  not  converge  in 
probability  to  the  desired,  true  parameters.) 

/ 

System  lifetime  censoring,  of  course,  can  occur  with  masked  data.  We, 
thus,  develop  the  likelihood  approach  for  that  case  in  Section  3. 

Although  these  techniques  were  developed  in  analyzing  the  reliability  of 
such  electronic  products  as  monitors,  graphic  display  terminals,  modems,  point 
of  sale  terminals,  etc.  (see  the  discussion  in  Usher  (1987)),  the  actual  data  is 
unavailable  for  this  article  due  to  proprietary  rights.  In  Section  4  a  small 
Monte  Carlo  study  investigates  the  effects  of  masking  on  the  estimators.  As 
expected,  the  mean  square  error  and  the  bias  get  worse  with  more  masking,  and 
they  improve  with  an  increase  in  sample  size. 

Concluding  remarks  about  analyzing  masked  data  are  presented  in  Section  5. 
We  also  comment  on  using  these  techniques  in  building  better  reliability  data 
bases. 

2.  THE  LIKELIHOOD  WITH  SYSTEM  TIME  COMPLETE  DATA 

We  deal  with  the  case  of  system  time  complete  but  masked  data  in  this 
Section.  We  develop  the  full  likelihood  for  this  case  when  the  system  is  a 
series  of  J  components.  (It  should  be  noted  that  a  similar  development  is 
possible  if  the  system  is  parallel,  "1-out-of-J, "  or  arbitrary.)  The 
"components"  could  also  be  "modules"  in  series.  See,  for  example,  Barlow  and 
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Proschan  (1981)  for  more  on  modules. 

Consider  a  sample  of  n  systems.  Let  T\  be  the  random  lif%  of  the  ith 
system;  i=l,...,n.  Let  be  the  random  life  of  the  jth  component  in  the  ith 
system;  j=l,...,J.  Note  that 


T.  =  min(T. . , . . . ,T. _) 
1  ll  lJ 


for  i=l, _ ,n.  We  assume  the  T^'s  are  independent.  For  each  fixed  j  the 

Tij,...,Tnj  represent  a  random  sample  from  component  j's  life  distribution  F  ^ . 

We  assume  Fj  has  a  density  (or  mass  function)  f^  indexed  by  the  parameter  vector 
0 j .  For  each  j  a  different  number  of  parameters  in  9^  is  allowed  if  needed. 

Let  Fj ( t)=l-Fj ( t)  be  the  reliability  of  component  j  at  time  t. 

To  precisely  derive  the  likelihood  involving  masking,  we  need  the 
following  notation.  Let  be  the  index  of  the  component  causing  the  failure  of 
system  i.  (We  assume  the  cause  of  failure  FL  is  unique,  of  course.)  Note  that 
Ki  is  a  random  variable.  Also,  may  or  may  not  be  observed.  That  is,  the 
component  causing  the  system  to  fail  may  be  masked  with  other  components  in  the 
system. 

Before  the  sample,  we  are  led  to  the  minimum  random  subset,  M^,  of 
components  known  to  contain  the  true  cause  of  failure  of  system  i.  In  short, 
e  r'L  and  r-L  is  minimum.  After  the  sample  data  is  taken,  we  observe: 

M.  =  S.  c  {1,2, 

(2.1) 

T.  =  t. 
l  l 

where  i=l,...,n.  If  =  (j),  then  we  know  =  j,  and  hence,  the  cause  of 
failure  is  not  masked.  If,  for  example,  =  {1,2},  we  have  e  but  the 
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true  value  of  is  masked. 


From  (2.1)  the  observed  data  can  be  expressed  as  ( t^ ,  ),...,( t  ,S  ) . 

We  now  derive  the  full  likelihood  for  this  data.  To  make  proper  probability 
statements,  we  let  each  f ^  be  a  probability  mass  function.  This  also  helps 
develop  a  reader's  intuition  for  masked  data.  The  situation  of  f_.  being  a 
density  is  analogous,  of  course. 

Consider  (t^S^)  and  its  contribution  c^  to  the  full  likelihood  L. 
ci  -  PtTi  ■  V  "i  -  si> 

■  p  [.“  <Ti  -  V  Ki  -  j-  "i  -  si>] 

1  bo  I 
J  1 

=  Z  P(T.  =  t. ,  K.  =  j,  M.  =  S.  ) 

1  11  J  i  l ' 

leSi 

=  r  F(T.  =  t.,  K.  -  j)  •  P(M.  =  S.|T.  =  t.,  K.  =  j) 

The  expression  P(M^  =  si |T^  =  t^,  =  j)  represents  the  conditional 

probability  that  the  observed  minimum  random  subset  is  S^,  given  that  system  i 
failed  at  time  t^,  and  the  true  cause  was  component  j.  For  =  {j},  the 
expression  is  the  conditional  probability  the  cause  of  failure  is  known.  For 
containing  more  than  j,  it  yields  the  conditional  probability  of  masking  with 
the  set  S. . 

l 

In  our  industrial  problems,  we  found  that  masking  usually  occurred  due  to 
constraints  of  time  and  expense  of  failure  analysis.  Schedules  often  dictated 
that  complete  failure  analysis  to  determine  the  true  cause  of  system  failure  be 
curtailed.  In  this  setting,  we  had  for  j'  fixed  and  j'  e  S. 

P(Mi  =  SjT.  =  t.,  K.  =  j’)  =  P(M.  =  Si|T.  =  t.,K.  =  j) 
for  all  j  e  S.  . 


Ti 

\* 

1  / 

s’ 

1  § 

V 

1  y. 

V 

i  y 

K' 

(2.2) 


As  a  result,  this  term  can  be  factored  out  of  the  summation  to  yield 


C;L  =  P(M.  =  S.|T.  =  t.,  K.  =  j')  •  Z  P(T.  =  t.,  K.  =  j) 

jeSi 


=  P(M.  =  S.  T.  =  t. ,K.  =  j' ) 
l  11  11  J 


2  n  F  (t. > ) 

jeSi  :  s=l  s  1 

s*j 


The  full  likelihood  under  (2.2)  is  then 


L  =  II  c.  . 
i=l  1 


Note  that  these  masking  probabilities  can  be  a  function  of  time.  Also,  we 


allow  P(M^=S  .jT^=t^ ,  K^= j ' )  t  P(N^=S^ jT^=t^ ,  K^=j)  for  j  $  .  We  assume  only 


that  the  masking  probabilities  conditional  on  the  time  and  cause  are  not 


functions  of  the  life  parameters.  We  state  this  for  future  reference  as 


P(Mf  =  |  =  t^,  fU  =  j ' )  does  not  depend 

on  the  life  distribution  parameters  . 


(2.3) 


This  is  analogous  to  a  censoring  distribution  not  depending  on  the  life 


distribution  parameters.  Cf.  Miller  (19811 


Using  (2.3),  we  write  a  reduced  or  partial  likelihood 


LR=  ,n  t  J  (fj<V  "  • 

1=1  neS.  J  s=l 


Under  (2.2)  and  (2.3),  maximizing  L  with  respect  to  the  life  parameters  is 


equivalent  to  using  L.  This  is  similar  to  the  usual  derivation  of  a  time 


11  • 11  '  *  ■ 
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censored  (and  not  masked)  data  partial  likelihood. 

The  above  clarifies  and  extends  Miyakawa's  (1984)  helpful  likelihood.  He 
had  m  systems  with  the  cause  of  failure  known  (not  masked),  while  n-m  were 
masked  (with  only  the  time  of  failure  known).  If  m  is  random  as  in  our 
industrial  problems,  then  his  likelihood  is  really  a  partial  likelihood.  (If 
for  some  reason  m  is  a  fixed  number,  our  full  and  partial  likelihood  are  the 
same) . 

We  suggest  it  best  to  view  his  likelihood  (as  well  as  L0  here)  as  a 
partial  likelihood  that  under  the  appropriate  conditions  will  yield  good, 
consistent  estimators.  Without  (2.3),  and  all  else  true,  for  example,  such  a 
partial  likelihood  can  yield  inconsistent  estimators.  (I.e.,  the  estimators 
will  not  converge  in  probability  to  the  desired,  true  parameters.)  For  proper 
statistical  applications,  it  is  important  to  be  clearly  aware  of  the  effects  of 
masking  probabilities  and  needed  conditions. 

3.  THE  LIKELIHOOD  WITH  SYSTEM  TIME  CENSORED  DATA 

Life  testing,  in  general,  can  result  in  censored  system  life  data.  System 
time  censoring  occurred  in  the  masked  data  we  had.  We  present  the  likelihood 
for  that  case  here.  Let  be  the  random  censoring  time  associated  with  the  ith 
system.  Let  G^(t)  =  P(Y^  <  t),  G^(t)  =  1  -  G^(t),  and  g^(t)  be  the  density  or 
probability  mass  function  of  Y^ .  Note  that  we  can  handle  each  Y^  having  a 
different  censoring  distribution.  We  also  allow  Y^  to  be  a  fixed  number  if 
needed .  We  have 

1  if  T.  <  Y^  (uncensored) 

0  if  T.  >  Y^  (censored) 
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L  =  n 

R  i=i 


z  [f .(t. )  n  f  (t.) 

jeS.  3  1  S=1  s  J 


i  r  J 


n  Fs(t.) 
s=l 


Again,  maximizing  the  life  parameters  with  L  under  the  stated  conditions  is 

K 


equivalent  to  using  L. 


The  above  likelihood  is  for  randomly  right  censored  data.  For  other  types 


of  censoring  on  the  system  lifetime  (e.g.,  for  interval  censored),  analogous 


likelihoods  can  be  derived  under  appropriate  conditions. 


4.  MONTE  CARLO  STUDY 


We  investigate  via  a  small  Monte  Carlo  study  the  effects  of  masking  on  the 


bias  and  mean  square  error  (MSE)  of  the  maximum  likelihood  estimators  (MLE's) 


derived  from  L  . 

K 


To  cover  simply  both  smaller  and  larger  sample  cases,  we  chose  to  simulate 


samples  of  size  n=10  and  100.  Consider  a  series  system  of  J=3  components  (or 


modules).  Components  1  and  2  form  a  subsystem.  The  reliability  function  is 


exponential,  F^(t)  =  e  j  ,  for  t  >  0  and  j=l,2,3.  For  easier  comparisons,  we 


used  X^  =  X2  =  X3  =  1.  The  exponential  random  variables  were  generated  using 


the  inverse  cumulative  distribution  function  method;  see  Kennedy  and  Gentle 


(1980)  . 


Simulating  the  effect  of  masking,  we  randomly  masked  the  true  cause  of 


system  failure  based  upon  a  total  masking  probability,  (i.e.,  proportion).  This 


total  masking  proportion  was  derived  from  the  probabilities  of  lb  =  {1,2,3,}  and 
M^  =  {1,2}.  This  meant  we  allowed  partial  masking  where  the  cause  of  failure 


was  known  to  be  in  the  subsystem,  {1,2},  or  total  masking  occurred,  {1,2,3; 
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other  masking  was  allowed.  To  satisfy  (2.2),  we  had  P(M^={1,2,3} |t^  =  t^,  Ft=j) 
was  constant  for  j =1 ,2,3  and  P(M^  =  {1,2} | =  t^,  =  j)  was  constant  for 

j=l,2.  Condition  (2.3)  was  also  easily  met  by  assigning  the  conditional 
probabilities  without  a  functional  dependence  on  the  life  parameters,  A^,  A^, 

a3. 

Let  n^,  r\2  and  n3  denote  the  number  of  failures  where  M^  =  {1},  =  (2), 

and  M^  =  {3}  respectively.  Let  denote  the  count  of  =  {1,2},  the  number 
of  partially  masked  failures.  Let  n.^3  represent  the  number  of  totally  masked, 
ML  =  {1,2,3},  systems.  The  MLE's  are  found  by  using  (2.4)  to  be 


X1  " 


nl  +  n12 


lnl+n2j 


+  n 


(  n^k  ' 


123 


[Ti^k+^k+n^j 


/  l  t 
i=l 


X2  “ 


n2  +  n12 


m2  ' 


lnl+n2j 


+  n 


(  n2k 


123 


.n^k+^k+n^j 


n 

/  z  t. 
i=l  1 


x3  = 


n3  +  n123 


ln1k+n2k+n3j  J'  ’i 


n 

/  Z  t4 


where  k  =  1  + 


12 


nl+n2 

With  no  masking  these  estimators  reduce  to  the  standard  MLE's  with  the 
numerator  of  number  of  failures  being  divided  by  the  total  time  on  test 
observed.  Note  that  with  masking  the  numerator  can  be  interpreted  as  an 
"estimate"  of  the  number  of  failures  caused  by  a  particular  component. 

Consider,  for  example,  n^*  How  many  of  n^  should  we  allocate  as  failures  due 
to  component  1?  It  is  natural  to  consider  an  empirical  allocation  of 


n 


12 


Ihi+ry 
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Similar  comments  apply  to  other  terms  in  the  numerators. 

It  is  critical  to  note  that  these  estimators  are  undefined  when  n^  +  n^  = 
0,  i.e.,  no  known  causes  of  failures  for  components  1  or  2  are  observed.  We, 
therefore,  restricted  our  study  to  consider  only  samples  where  n^  +  >  0. 

This  condition  also  assured  us  of  not  allowing  a  sample  with  all  the  data 
masked.  (Mote  this  is  analogous  to  censored  data  Monte  Carlo  studies  where  a 
sample  with  all  censored  observations  is  excluded.) 

The  entire  simulation  was  programmed  in  FORTRAN  and  run  on  a  VAX  11/750. 
The  results  are  based  on  100,000  repeated  samples.  This  number  of  replicates 
was  needed  for  very  large  masking  proportions  to  assure  an  adequate  number  of 
samples  not  having  every  system  being  masked.  This  difficulty  also  lead  us  to 
simulate  masking  proportions  only  up  to  95%.  The  results  are  graphed  in  Figures 
1  and  2 . 

As  expected,  the  bias  and  MSE  get  worse  with  increased  masking.  They 
improve  with  an  increase  in  sample  size  from  10  to  100.  For  no  masking  (i.e., 
the  masking  proportion  is  0)  the  three  estimators  are  the  standard  MLE's  with 
bias  l/(n-l).  In  Figure  1  we  have  graphed  the  baseline  bias  for  each  n  for 
reference.  For  Figure  2  the  reader  could  draw  in  the  similar  baseline  MSE's  by 
using  the  value  at  proportion  0.  Note  that  the  bias  and  MSE  are  fairly  well 
behaved  in  spite  of  masking  as  large  as  50%  or  60%  (even  for  n=10).  For  most 
industrial  problems  masking  would  be  smaller  than  that. 

Components  1  and  2  have  bias  and  MSE  that  track  each  other  as  would  be 
anticipated  since  they  form  a  subsystem.  Note  that  for  rather  large  masking 
proportions,  however,  they  become  even  more  positively  biased.  To  compensate 
for  this,  component  3  becomes  strongly  negatively  biased.  It  seems  for  very 


heavy  masking  that  the  numerator  in  (and  X£)  may  over  assign  masked  failures 
as  due  to  components  1  (and  2).  (Recall  earlier  comments  about  the  numerator  is 
"estimating"  the  number  of  failures  due  to  that  component.)  These  aspects  might 
motivate  the  search  for  modified  estimators  when  the  masking  is  very  heavy.  For 
n=100,  however,  the  bias  and  MSE  do  rather  well  even  up  to  80%  or  90%  masking. 
From  our  industrial  reliability  analyses,  we  found  the  MLE's  using  L_  to  be  very 
reasonable . 


5 .  CONCLUDING  REMARKS 

We  have  presented  an  approach  which  has  actually  been  applied  and  found 
useful  in  a  real  world  setting.  Unfortunately,  the  data  is  not  available  due  to 
proprietary  rights.  We  had  these  problems  arise  in  system  life  testing.  We 
feel,  however,  the  approach  could  also  be  useful  in  building  better  reliability 
data  bases  on  components  using  field  systems  data  (as  well  as  life  testing  data 
on  systems).  Cf.  the  insightful  paper  by  Doss,  Frietag,  and  Proschan  (1985) 
where  they  use  complete  (not  masked)  system  lifelengths  to  estimate 
nonparametrically  component  reliabilities  (some  component  lifetimes  may  be 
censored,  while  others  could  be  complete). 

In  analyzing  masked  data,  it  is  important  to  understand  the  mechanism 
causing  the  masking.  If  masking  probabilities  and  conditions  such  as  (2.2)  and 
(2.3)  are  overlooked,  estimators  could  turn  out  to  be  inconsistent.  With 
careful  attention  to  masking,  valuable  information  can  be  incorporated  properly 
for  statistical  analyses  of  key  industrial  devices. 

When,  for  example,  condition  (2.2)  is  not  true,  how  could  a  likelihood 
approach  be  developed?  We  are  currently  working  on  a  modified  method  based  on 
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the  EM  algorithm  (see,  e.g.,  Cox  and  Oakes  (1984))  to  accomplish  that. 

Finally,  a  likelihood  development  suggests  building  a  Bayesian  framework 
for  analyzing  masked  data  and  the  true  cause  of  failure.  We  are  also  exploring 
this  construction. 
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